# Multimodal speech understanding
Ultravox V0 6 Qwen 3 32b
MIT
Ultravox is a large multimodal speech language model capable of understanding and processing speech input, supporting multiple languages and noisy environments.
Audio-to-Text
Transformers Supports Multiple Languages

U
fixie-ai
1,240
0
Ultravox V0 6 Gemma 3 27b
MIT
Ultravox is a multimodal large speech language model that can process both speech and text inputs simultaneously, providing strong support for speech interaction scenarios.
Text-to-Audio
Transformers Supports Multiple Languages

U
fixie-ai
641
2
Ultravox V0 6 Llama 3 3 70b
MIT
Ultravox is a large multimodal speech language model that combines a pre-trained large language model and a speech encoder, capable of handling both speech and text inputs.
Text-to-Audio
Transformers Supports Multiple Languages

U
fixie-ai
196
0
Ichigo Llama3.1 S Instruct V0.3 Phase 2
Apache-2.0
The Ichigo-llama3s series models natively support audio and text input comprehension, based on the Llama-3 architecture, using WhisperVQ as the tokenizer for audio files.
Text-to-Audio English
I
homebrewltd
16
5
Speechllm 2B
Apache-2.0
SpeechLLM is a multimodal large language model trained to predict speaker turn metadata in conversations, including speech activity, transcribed text, speaker gender, age, accent, and emotion.
Text-to-Audio
Transformers English

S
skit-ai
237
16
Featured Recommended AI Models